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Abstract 



When using item response theory (IRT) models in educational 
and psychological measurement, it is standar^ practice to estimate 
the operating characteristics of test items from examinees' item 
responses alone. This is the final report of a project that 
employed Bayesian and empirical Bayesian methods to exploit 
additional information that is often available about test items 
(e.g., format, content, or cognitive processing requirements) or 
about examinees (e.g., educational background or demographic 
status). Practical and theoretical results obtained in a series 
of research reports are summarized. 

Key words: Bayesian Estimation, Collateral Information, 

Differential Strategies, Empirical Bayes 
Estimation, Information Matrices, Item Response 
Theory, Missing Data 
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Introduction 



Item response theory (IRT) models in psychometrics give the 
probability that an examinee will respond correctly to a given 
test item in terms of parameters for just that examinee and that 
item. This formulation makes it possible to solve many practical 
measurement problems that are difficult or intractable under 
classical test theory, including adaptive ability testing, large 
population equating studies, and test construction to targeted 
operating specifications. 

It is standard practice to estimate IRT item parameters 
solely from the observed responses of a sample of examinees. This 
project was motivated by a desire to improve estimation by 
exploiting collateral information that is often available about 
test items (e.g., format, content, or cognitive processing 
requirements) or about examinees (e.g., educational background or 
demographic status). Table 1 lists the reports from the project 
exploring both practical and theoretical aspects of the problem. 
The present report summarizes the main results. The interested 
reader is referred to the individual papers for details, 
derivations, and examples. 

Table 1 about here 
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Incorporating Collateral Information into IRT 



The initial thrusts of the project were to determine how to 
incorporate collateral information into estimation procedures when 
the IRT model is correct, and to gauge its impact on estimation 
precision. Bayesian and empirical Bayesian methods were employed 
to this end. This section describes the basic model (Mislevy, 

1987 ; in press ) . 

Under an IRT model, the probability of response x^ to Item j 
with a possibly vector- valued item parameter 0^ from an examinee 
with proficiency parameter 9 is given as 



P(x.|0,/3.) - f(x 1 0,0 ) , 



( 1 ) 



where the form of the item response function f is known up to the 
item parameters. Under the usual assumption of local 
independence, the conditional probability of the response pattern 
x = (x^, • • • > x n ) to n test: i tems simply the product of 
expressions like (1) : 



P(x|0,j8) - n P(x \O,0 ) , 

j J J 



( 2 ) 



where 0 = (0 . . . ,0 ^) . Let the data matrix X - (x^, . . . ,x^) 

represent response vectors observed from a sample of N examinees 
from a population in which 9 follows the density p(0). The 
likelihood for 0 induced by X is obtained as 



2 



8 



( 3 ) 



L <g|X) - n J* f(x.|(?,/5) p (0) de . 
i 

"Marginal maximum likelihood" (MML) estimates of item parameters 
(e.g., Bock and Aitkin, 1981) are obtained by maximizing (3) with 
respect to 0 . 

Suppose that in addition to item responses, values of 
collateral variables y are also available from examinees. The 
appropriate marginal likelihood is now 

L (£|X,Y) - n J fOCjjtf.g) p(0|y.) de . (4) 

# y i 

MML estimates of item parameters that exploit collateral 
information about examinees are obtained by maximizing (4) with 
respect to 0 (Mislevy, 1987). 

Bayesian item parameter estimates are obtained from posterior 
distributions for /3, which arise as the normalized product of a 
likelihood function such as (3) or (4) and a prior distribution 
for 0 , say g(/3) . If, before observing data, one possesses no 
information to differentiate expectations about the parameters of 
different items, an exchangeable prior for 0 is appropriate; that 
is, the items are modeled as if they were n random draws from the 
same distribution. In this case the posterior distribution is 
given by 



n 



3 



( 5 ) 



p (£,X) a L x (g|X) n g(/9 ) 

j J 



or 



p (/3|X,Y) ac L (/9|X,Y) n g(/8.) , 
Xy 7 j J 



( 6 ) 



depending on whether collateral information is available about 
examinees. If values on the collateral variable z are 
additionally available about items, they are incorporated as 



P xz (£|X,Z) OC L x (g|X) n g(/3 |z ) 

j J ' 



(7) 



or 



p (/3|X,Y,Z) « L <g|X,Y) n g(/3 |z ) 
Ayz. — — - j J J 



( 8 ) 



(Mislevy, in press). Standard Bayesian procedures for estimating 
item and population parameters that do not employ collateral 
information extend to (7) and (8) in a straightforward manner 
(Mislevy, 1987, in press). 

Increase in Information: Theoretical Results 
Using general results about missing data problems, such as 
Orchard and Woodbury's (1972) "missing information principle " it 
is possible to derive upper and lower bounds for the expected 
precision of item parameter estimates with and without collateral 
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information (Mislevy and Sheehan, 1988, in press). The results 
are expressed most easily in Bayesian terms. 

Consider first the impact of collateral information about 
examinees . Let V(/3|0,X,Y) represent the posterior variance of ft 
that would be obtained after observing values of not only item 
responses x and collateral variables y from a sample of N 
examinees, but values of their latent proficiencies 0 as well. 

Let analogous expressions represent posterior variance of 0 when 
values of one or more types of variables are not observed; for 
example, V(/3|X) when only item responses are observed. The 
following relationships may be derived: 

E[V(£|£,X,Y)) = E[V(g|£,X)] 

< E[V(g|X,Y)l 

< E[V(g|X)] . 

where A<B means that the matrix difference B-A is at least 
positive semidef inite . Thus the precision of item parameter 
estimation when using collateral information about examinees along 
with item responses is at least as great as that expected when 
using item responses alone, but cannot exceed the precision that 
would be expected with the same sample size if values of the 
latent variable $ could be observed as well. 

An obvious lower bound holds the impact of collateral 
information about items : 
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E[V(g|X.Z)J < E[V(^|X) ] ; 



that is, expected precision when using collateral information 
about items iii addition to item responses, equals or exceeds 
precision expected when not using it. No ordering holds between 
E[V(0|X,Z)] and E[V(/9|0,X)] in general. In particular, when Z is 
employed along with X, it is possible to exceed the precision 
obtainable with Q and X. 

Increase in Information: Practical Results 

By examining the structure of information matrices with and 
without collateral information, and by applying the methods to 
data from the National Assessment of Educational Progress (NAEP) 
and the Profile of American Youth surveys, it was found that 
modest increases in the precision of item parameter estimates can 
be achieved by using collateral information (Mislevy, 1987, in 
press; Mislevy and Sheehan, 1988, in press). 

From collateral information about examinees, increases in 
information depend on the strength of the relationship of the 
collateral variables with 0. In typical educational and 
psychological settings where collateral information can often 
account for about a third of the population variance, and with 
item reliabilities typical of those settings, gains equivalent to 
2 to 6 additional test items can be expected. This gain is 
substantial when few responses are available from each examinee, 
as in educational assessments, and may be useful in adaptive 
testing where tests are short but well - targeted . It is 
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unimpressive in individual achievement testing, where tests of 
sixty items or more are common. 

From collateral information about items . increases equivalent 
to hundred and fifty additional examinees were found for Rasch 
item difficulty parameters in a junior high fractions test 
(Mislevy, in press) . While a gain of this magnitude would be 
unimpressive in applications where data from thousands of 
examinees is already at hand, it is meaningful in situations when 
either (1) few examinees have been tested, as in the fractions 
example or in local testing problems, or (2) no examinees have 
been tested, as when approximating item statistics for newly- 
written test items. 

In addition to small-sample applications, collateral 
information about items can play an important role in both item 
construction and diagnosis regardless of sample size. The 
conditional distributions of item parameters, p(/3|z), express item 
operating characteristics such as difficulty in terms of salient 
features of the items. To the degree that these distributions 
succeed in explaining item operating characteristics, the test 
constructor can manipulate the features to modify items in 
intended ways or to create new items that tap the same essential 
skills. To the degree that items depart from the centers of these 
predictive distributions, they are hard or easy for reasons other 
than those held most important in describing the domain. Outliers 
are suspect as flawed or irrelevant. The approach implied by (5) 
and (6) is a step in the direction of integrating educational and 
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psychological theory into the measurement process . (Its 
application to the items in the Document Utilization scale of the 
NAEP Survey or Adult Literacy is currently in progress.) 

When Collateral Information Must Be Used 

The preceding sections discuss how, when all examinees are 
presented all items, collateral information about examinees and 
items may be exploited to obtain more precise item parameter 
estimates. Consistent estimates are still obtained in this case 
if the collateral information is not used (Mislevy and Sheehan, 
in press) . The same results apply when each examinee receives 
only a random subset of items. 

This is not the case that obtains in many practical 
applications of IRT, however. In order to obtain more information 
about item or examinee parameters per observed response, items arc 
often administered to examinees as a function of item and examinee 
collateral variables. Fourth grade students may be presented an 
easier test form than the overlapping form fifth graders receive, 
for example; and a high school graduate may be presented a harder 
item first in an adaptive test than a nongraduate. In order to 
obtain consistent MML item parameter estimates, it is mandatory to 
employ collateral information about examinees -- i . e . , to use (4) 
rather than (3) (Mislevy and Sheehan, in press). In order to 
obtain the correct Bayesian inferences, it is mandatory to use 
collateral information about items as well — i.e., to base 
inferences on (8) rather than (4) (Mislevy and Wu, 1988). Mislevy 
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and Sheehan (in press) give a simple counterexample with the Rasch 
model to demonstrate an asymptotic bias in item parameter 
estimation in such a case if collateral information is ignored. 

Modeling Item Responses when Different Examinees 
Follow Different Solution Strategies 

Initial work on using collateral information about items 
assumed that the IRT model was strictly correct. Thinking about 
the features of items that made them easy or hard, however, made 
it clear that difficulty depends on the way that the examinees are 
attempting to arrive at their answers. In particular, different 
features of items can make them differentially difficult for 
examinees who follow different solution strategies. This insight 
led to the formulation of a mixture of IRT models (Mislevy and 
Verhelst, in press) . Resolving the mixture demands a type of 
collateral information that plays no role whatsoever in 
traditional psychometrics, including standard IRT: psychological 
theory about the different strategies that examinees might follow. 

The key idea is to model item difficulty in terms of salient 
item features --features that tend to make an item easy or 
difficult under various strategies. The Mislevy-Verhels t model 
makes the following assumptions: 

1. A finite number of known solution strategies apply. 

2. Each examinee is applying the only one of these strategies 
for all the items in the set. 
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The responses of an examinee are observed but the strategy 
he or she has employed is not. 

4. The responses of examinees following Strategy k conform to 
an item response model of a known form. 

3. Substantive theory posits relationships between observable 

features of items and the probabilities of success enjoyed by 
members of each strategy class. The relationships may be 
known either fully or only partially- -e . g. , known as to 
parametric form but not parameter values. 

Let 8 * (0- f ...,0 ) be an examinee proficiency parameter, 
with the element 8 ^ corresponding to proficiency if Strategy k is 
employed. Let <f> * (<£.,...,<£ ) be an examinee strategy parameter, 
with all elements zero except for the single element k 
corresponding to the strategy that is employed; this element takes 
the value 1. Let the operating characteristics of Item j under 
Strategy k be given as follows: 

p ^WV a)t *k“ 1] ‘ f kf x j"k^k (z jki a)1 ■ (9) 
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where /L ( z.,|a), the item parameter for Item j that applies when 
k J k 

examinees follow Strategy k, depends on its salient features z., 

J K 

under that strategy and a relatively small number of basic 

strategy parameters a. The MML function for estimating a induced 

by the data matrix X from a sample of N examinees and the 

item/strategy collateral variables Z is obtained as 

10 
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L(cr|X,Z) 



N K n 

n S n n 

i-1 k-1 j-1 



f k [x i:j | fl ^ k ( Zj k|a)] g k (») d* 



( 10 ) 



where g, is the density of 9, among those examinees following 

rC K 

Strategy k, and n^ is the proportion of the population who do so. 
If the g^s and the ns are not known, they too can be estimated via 
MML by maximizing (10) with respect to them as well. 

If the as, g^s , and ns are known or well estimated, it is 
possible to calculate for a given examinee the probability that 
his response vector was produced u.^der a given strategy and to 
estimate his ability under each possibility. By Bayes theorem, 
the posterior probability of Strategy k and proficiency 9 under 
that strategy is obtained as 



P(0,0 k -l|x) = C f k (x|0,0 k (Zj k )) g k (0) * k 
where C is the normalizing constant obtained as 




The posterior probability that Strategy k was employed is 
P(* k -l|x) - f P(0,* k -l|x) d0 

and the posterior mean proficiency conditional on <^*1 (i.e., 
supposing that Strategy k was used) is 
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E(0 k |x,0 k «l) -/ 8 P(«,^ k -l|x) de P* 1 (0 k “l|x) . 



The significance of this model lies in its ability to express 
hov examinees solve items rather than just hov many they solve . 

The latter is all that the standard models of test theory can do. 
Areas of potential benefit include psychological investigations of 
alternative processing models, educational decisions involving 
level of understanding, and determinations of alternative mental 
models in problem solving. The approach opens the door to such 
applications as (1) adaptive testing schemes designed to infer how 
examinees solve problems as well as how well they solve them, and 
(2) studies of changes in the structure as well as the level of 
intelligence in the course of human development. 

Inferring Examinee Ability When Some Item Responses Are Missing 

In practical applications of item response theory (IRT), 
there are several reasons that item responses may not be observed 
from all examinees to all test items. The reason most germane to 
the collateral information problem is the intentional 
administration of only subsets of items to examinees, with the 
subset depending on collateral information. It was mentioned 
above that collateral information must be taken into account in 
these cases. In addition to this type of missingness, Mislevy and 
Wu (1988) studied problems of inference that arise with several 
other types of missingness that arise frequently in IRT. 

To preface the results of their study, we review Rubin's 
(1976) notions about " ignorabil ity" of missing data. Ignoring the 
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missingness process under direct likelihood inference means using 
a pseudo- likelihood that includes terms for only the responses 
that were observed, without regard for the processes by which they 
came to be observed. The resulting inferences are appropriate if 
the pseudo- likelihood is proportional to the correct likelihood 
that does account for the missingness process. In this case the 
correct point estimate of the maximum likelihood estimate (MLE) is 
obtained. Sampling- distribution inferences based on the MLE are 
appropriate only if the missingness pattern does not depend on the 
values of the observed data. When this condition holds, sampling- 
distribution inferences can be drawn with regard to repeated 
samples of responses to only those items whose responses were 
observed. The missingness process is ignorable with respect to 
Bayesian inference if the correct Bayesian posterior is 
proportional to the product of the pseudo- likelihood and an 
appropriate prior distribution. 

For fives common types of missingness in IRT, Mislevy and Wu 
first used Rubin's (1976) theorems to determine whether 
ignorability holds under direct likelihood and Bayesian inference 
about examinee parameters 9 when item parameters 0 are known. In 
those cases in which the correct value of the MLE is obtained 
under direct likelihood inference, they asked whether sampling 
distribution inferences based on the MLE were appropriate. They 
then considered the analogous questions for inferences about 0 
when the examinee parameters are eliminated by marginalization, as 




in (3) -(8). The findings are summarized below. Tables 2 and 3 
highlight the results on ignorability . 



Tables 2 and 3 about here 



Case 1: Alternate Test Forms . When an examinee is assigned 
one of several alternative test forms by a random process such as 
a coin flip or a spiralling scheme, the process that renders 
missing the responses to items on the forms not presented is 
ignorable for all three types of inference, both for estimating /3 
and for estimating 9 when /3 is known. 

Case 2: Targeted Testing . When collateral variables such as 
educational or demographic status are used to assign an examinee 
one of several test forms that differ in their measurement 
properties, the resulting missingness on forms not given is 
ignorable under direct likelihood inference for 9 given (3 , but not 
under Bayesian inference unless the prior information about 
examinees that led to differential assignments is conditioned on. 
This information must be taken into account for both likelihood 
and Bayesian inferences about fi ; for Bayesian inference, prior 
information about fi used to select items must additionally be 
taken into account. Sampling distribution inferences may be based 
on MLEs for /3 and for 9 given fi , conditional on the observed 
patterns of form administration within values of the examinee 
variables used for targeting. 



It should be emphasized that these conclusions depend on the 
veracity of the IRT model. In particular, it is necessary that 
the regression of a correct response on ability be invariant with 
respect to collateral information. This assumption may well fail 
in a situation of currently increasing interest: An item pool is 

calibrated using an IRT model, and a school is allowed to measure 
students using only those items it deems relevant to its 
curriculum. If students from different schools have had different 
opportunities to learn the skills tapped by different items, then 
tailoring tests to their strengths leads almost certainly to item 
by school by ability interactions - -a violation of the IRT model. 
Estimates for schools and individuals within schools tend to 
overestimate the scores they would have received had they been 
given all items, or randomly selected subsets of items. This use 
of IRT may hold practical value nonetheless, provided that such 
scores are viewed not as consistent estimates of performance in 
the total pool but as indicators of a kind of maximal performance . 

Case 3: Adaptive Testing . In adaptive testing, item 
assignment proceeds item by item for each examinee according to 
the values of his responses to preceding items. The same 
conclusions as for Case 2 hold for direct likelihood and Bayesian 
inference. Ignorability under direct likelihood inference means 
that the correct points are identified as MLEs of 9 given 0 and of 
0 , The usual MLE properties under sampling-distribution inference 
need not hold, however, because the probabilities of missingness 
patterns depend on the values of observed responses . 

J « 
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Case 4: Not-reached Items . When some examinees run out of 



time before they see the last items on a nearly nonspeeded test, 
the not-reached process is ignorable with respect to direct 
likelihood inference about 8 given ft , and the MLE supports 
sampling distribution inferences that pertain to repeated 
administrations of the items that were actually reached. This 
missingness process is not ignorable under Bayesian inference 
unless speed and ability are independent. An' only then can 
direct likelihood inferences about ignore the missingness. 
Furthermore, Bayesian inferences about require that collateral 
variables for items be employed if they played a role in 
determining which items would not be reached, as when items are 
ordered from easy to hard. 

Case 5: Intentional Omission . When examinees are presented 
items, have a chance to appraise their content, and decide for 
their own reasons not to respond, the missingness is not 
ignorable. Inferences must be drawn from a full model for the 
joint distribution of missingness and item response. 

Not surprisingly, modeling this nonignorable nonresponse is 
difficult. Neither of the two most ambitious approaches proposed 
to date, namely Lord's (1983) model for omits and the use of 
multiple-category IRT models (e.g., Bock, 1972), handles the issue 
of local independence in a fully satisfactory manner. Under 
Lord's (1983) model, the marginal model for item responses is not 
a standard IRT model depending on 8 alone and exhibiting local 
independence. Under the multiple-category model approach, local 
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independence fails unless all examinees at any given ability level 
have the same propensity to omit items they are unsure of, rather 
than guess at random. 

If one assumes that examinees are perfect judges of their 
chances of responding correctly, and omit only if it is in 
accordance with the strategy that maximizes their expected score, 
Lord's (1974) treatment of omits as fractionally correct can be 
justified as providing the expectation of a conditional term in 
the full likelihood for omission probabilities and correct- 
response probabilities. This procedure is readily incorporated 
into standard complete- data IRT algorithms and avoids having to 
specify the full likelihood, but sacrifices information about 
examinee and item parameters conveyed by the observed pattern of 
missingness. Given the complexity of models for the full 
likelihood, however, this expedient seems to be a good practical 
choice- -provided that, as Lord urges, examinees are clearly 
informed about how omits will be scored and which omitting 
strategy maximizes their chances of scoring well. 

Conclusion 

Although collateral information about examinees and items is 
rarely employed in item response theory (IRT), it is straight- 
forward to incorporate it using Bayesian and empirical Bayesian 
methods. If the IRT model is correct and examinees are assigned 
items independently of values on collateral variables, then 
collateral information can be used to improve item parameter 
estimation modestly. Employing collateral information is 
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mandatory to obtain correct Bayesian and empirical Bayesian 
inferences if it was used to assign items to examinees. 

Aside from considerations of efficiency, employing collateral 
information about items is a step toward integrating educational 
and psychological theory into the measurement process . Two 
aspects of this idea were developed in the course of the project. 

The first, which takes a more traditional measurement 
perspective, assumes that a single IRT model provides an 
acceptable fit to the data of interest. Modeling items' operating 
characteristics in terms of salient features can make estimation 
more precise, but more importantly it elucidates the reasons that 
items are hard or easy, and why some are more discriminating than 
others. A formal framework is thus available for item 
construction and diagnosis, expressing relationships among 
substantive theory, item features, and measurement properties. 

The second is a response to a growing awareness of the fact 
that traditional psychometric models (IRT as well as classical 
test theory) measure what is essentially an overall level of 
prof iciency- - los ing in the process qualitative differences among 
examinees that arise from different cognitive solution strategies. 
In order to extend psychometric analysis to these problems, and to 
bring to bear the findings of recent research upon applied 
measurement problems, it is mandatory to employ collateral 
information about examinees and items that bears upon the ways 
that people solve problems. A mixture of IRT models that applies 
to some problems of this type was introduced in the project. 
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Table 2 



Ignorability Results for Estimating 9 Given 0 



Type of Inference 

Type of - - 

Missingness Direct Likelihood Bayesian Sampling Distribution 



Alternate Yes 

Forms 

Targeted Yes 

Forms 

Adaptive Yes 

Testing 

Not-Reached Yes 



Intentional 

Omissions No 



Yes Yes 

Yes , given Yes 

examinee variables 

Yes, given No 

examinee variables 
if they are used 

No, unless speed and Yes 

ability are independent 

No No 



Conditional on the observed pattern of missingness. 
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Table 3 



Ignorability Results for Estimating After Marginalizing over 6 



Type of 
Missingness 




Type of Inference 




Direct Likelihood 


Bayesian Sampling Distribution 


Alternate 

Forms 


Yes 


Yes 


Yes 


Targeted 

Forms 


Yes , given 
examinee variables 


Yes , given 
examinee and item 
variables 


Yes , given 
examinee variables 


Adaptive 

Testing 


Yes, given 
examinee variables 
if they are used 


Yes , given 
item variables and 
examinee variables 
if they are used 


No 


Not-Reached 


No, unless speed 
and ability are 
independent 


No, unless speed 
and ability are 
independent 


No, unless speed 
and ability are 
independent 


Intentional 

Omissions 


No 


No 


No 



Conditional on the observed pattern of missingness. 
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